library(dplyr)
## OR
library(tidyverse)04: Filter and Select
Overview
This tutorial covers two important {dplyr} functions: filter() and select(). Easy to confuse, filter() uses logical assertions to return a subset of rows (cases) in a dataset, while select() returns a subset of the columns (variables) in the dataset.
To remember which does which:
filter()works on rows, which starts with “r”, so it contains the letter “r”.select()works on columns, which starts with “c”, so it contains the letter “c”.
Setup
Packages
We will be focusing on {dplyr} today, which contains both the filter() and select() functions. You can either load {dplyr} alone, or all of {tidyverse} - it won’t make a difference, but you only need one or the other.
Data
Today we’re going to start working with a dataset that we’re going to get familiar with over the next few weeks. Courtesy of fantastic Sussex colleague Jenny Terry, this dataset contains real data about statistics and maths anxiety.
Codebook
There’s quite a bit in this dataset, so you will need to refer to the codebook below for a description of all the variables.
This study explored the difference between maths and statistics anxiety, widely assumed to be different constructs. Participants completed the Statistics Anxiety Rating Scale (STARS) and Maths Anxiety Rating Scale - Revised (R-MARS), as well as modified versions, the STARS-M and R-MARS-S. In the modified versions of the scales, references to statistics and maths were swapped; for example, the STARS item “Studying for an examination in a statistics course” became the STARS-M item “Studying for an examination in a maths course”; and the R-MARS item “Walking into a maths class” because the R-MARS-S item “Walking into a statistics class”.
Participants also completed the State-Trait Inventory for Cognitive and Somatic Anxiety (STICSA). They completed the state anxiety items twice: once before, and once after, answering a set of five MCQ questions. These MCQ questions were either about maths, or about statistics; each participant only saw one of the two MCQ conditions.
For learning purposes, I’ve randomly generated some additional variables to add to the dataset containing info on distribution channel, consent, gender, and age. Especially for the consent variable, don’t worry: all the participants in this dataset did consent to the original study. I’ve simulated and added this variable in later to practice removing participants.
Thinking Like a Coder
From this point onwards, we’re going to start digging into some data wrangling skills. One aspect of this process that is often difficult for people new to coding and/or accustomed to a point-and-click interface is the separation between you and the dataset. That is, you can’t “see” what’s happening as you make changes, or edit your dataset directly; instead, you write code and execute it without being able to observe the changes being made directly. When I was learning R, this felt like a glass wall between me and my dataset that I found very frustrating.
To help with this, we’ll start explicitly practicing a recursive process for any task we want to accomplish in R. It goes something like this:
- Anticipate. Understand clearly what you want your code to accomplish, and have an idea (at least roughly, but preferably in some detail) of what the output will be when the task has been executed correctly.
- Code. Write the code to accomplish the task.
- Evaluate. Compare the output you have actually produced to the output you anticipated producing and judge whether the code has done what it was meant to do or not.
With these three steps, you can ACE any task in R!1
Especially if you are new to coding, I strongly recommend you practice these steps explicitly. For simpler examples or exercises, it might seem trivial. However, this iterative process is absolutely essential to accurate, safe, and efficient coding, and it will become second nature over time.
When I say “safe” code, I don’t mean that your computer will explode. Just as in data wrangling using any application or method, there’s always a risk of making mistakes. When you are coding, however, certain types of mistakes are easier to miss, because the output of your commands is often “hidden” (stored) inside objects. It is definitely possible - and happens often! - that code will run successfully, and even appear to produce the right output, but has in reality not done what you intended it to do. (There’s a great, if complicated, example of this just a bit further down.)
When I say “safe” code, I mean code that is resilient to such mistakes, and has been checked carefully to ensure that it does only and exactly what it is meant to do. In other words, code that you are (at least reasonably!) sure will produce the same results again when you run it again in the future, and that what it does is what you intended. This recursive Anticipate-Code-Evaluate process is the method I use to ensure my code is safe.
Filter
The filter() function’s primary job is to easily and transparently subset the rows within a dataset - in particular, a tibble. The filter() function takes one or more logical assertions and returns only the rows for which the assertion is TRUE. Columns are not affected by filter(), only rows.
General Format
Single Assertions
The logical_assertion in the general format above is just like the assertions we saw in the first tutorial. The rows where the assertion returns TRUE will be included in the output; those that return FALSE will not. Inside the filter() command, use the names of the variable in the piped-in dataset to create the logical assertions.
As a first example, let’s subset our anx_data tibble to only contain people who completed the maths MCQs. We might then anticipate that if we accomplish this task correctly, we will have the same variables in the dataset that we had before, but only the rows where the mcq variable contains the string "maths".
To do this, we’ll of course use filter(), along with a familiar operator from the first tutorial, to write the logical assertion about a variable that exists in the anx_data tibble, namely mcq. Here’s the code:
- 1
-
Take the dataset
anx_data, and then filter it keeping only the cases where the following assertion is true: - 2
-
The value in the
mcqvariable is exactly and only equal to"maths".
Finally, let’s evaluate the output from this command. We do indeed get all the same variables (columns) as before - but the number of rows is smaller than it was previously. Scroll along to the mcq variable, and you will see that each row contains the same value: "maths". So, our output contains only the rows (cases) from the original tibble that have the value "maths" in the mcq variable, and NOT "stats" nor any NAs (because NA does not equal "maths").
I happen to know that this filter() command is correct, so a quick glance through the data is enough of a check. However, what if you weren’t sure this was okay, or if you didn’t want to rely on visual checks? How might you evaluate the output of this command?
Recall that in the anticipate stage, we said that the output should contain only the rows from participants who saw the maths MCQs. So, we should expect that if our filtering has been done correctly, the number of rows of the filtered dataset that we get as output should equal the number of people who saw the maths MCQs.
## Count how many people saw each type of MCQ
anx_data |>
dplyr::count(mcq)
## Same code as above, but passing the output on to nrow() to count the rows
anx_data |>
dplyr::filter(
mcq == "maths"
) |>
nrow()[1] 233
So, we can see visually that the dataset contains the value “maths” in the mcq variable, AND we have checked that the number of rows in the filtered dataset is the same as the number of people who saw maths MCQs. I would be quite confident at this stage that the filtering had been done correctly. Success!
Remember that for exact matches like this, we must use double-equals == and not single-equals =. If you use single equals, you’re not alone - this is such a common thing that the (incredibly friendly and helpful) error message tells you what to do to fix it!
anx_data |>
dplyr::filter(mcq = "maths")Error in `dplyr::filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `mcq == "maths"`?
Naturally, we can also filter on numeric values. For our next example, let’s say we only want to retain participants from the original dataset who were younger than 40 years old. So, we’ll anticipate that the resulting dataset will only contain values in the age variable less than 40, and then write the code to do so:
- 1
-
Take the dataset
anx_data, and then filter it keeping only the cases where the following assertion is true: - 2
-
The value in the
agevariable is less than 40.
To evaluate this code, we can again look through the output and check whether each value in the age variable is less than 40.
This is a bit less easy to check by eye than the first example. If we wanted to do more in-depth checks to ensure our code worked correctly, we could carry on with our pipe and use some of the techniques we covered a couple tutorials ago.
anx_data |>
# Perform the filtering as above
dplyr::filter(
age < 40
) |>
# Get out all the values in the age variable in the output tibble
# Good example of pull() in a pipe ($ wouldn't fly here)
dplyr::pull(age) |>
# Return the max value of those ages
max()[1] 39
Note that this works as a check for the filter() command because this is one continuous pipe. That is, the values in the age variable extracted by pull() are from the filtered dataset, because the input to pull() is the output from filter() in the preceding step.
As an extension of this, instead of using a specific value (like 40), we can also use values that we calculate using the data itself. For instance, let’s only retain participants as old as the median age of the sample, or younger. Here we can take advantage of the fact that we can use variable names as objects inside {dplyr} functions like filter()2.
anx_data |>
dplyr::filter(
age <= median(age, na.rm = TRUE)
)
As a final example, let’s consider a situation where we want to retain only participants that gave a gender identity of either “male” or “female”.3
To do this, we need a new operator: %in%, which God knows I just pronounce as “in” (try saying “percent-in-percent” three times fast!). This looks for any matches with any of the elements that come after it:
- 1
-
Take the dataset
anx_dat, and then filter it keeping only the cases where the following assertion is true: - 2
-
The value in the
gendervariable matches any of the values “female” or “male”.
==?
What follows here is a rabbit hole that gets into some gritty detail. If you’re happy to take my word for it that you absolutely, definitely needed %in% and not == in the previous exercise, you can skip the explanation below. If you’re keen to understand all the nuance, click to expand and read on!
== vs %in%
For this matching task, you might have thought we’d use gender == c("female", "male"), which runs successfully and sure looks okay. So why isn’t this right?
## DO NOT DO THIS
anx_data |>
## THIS DOES NOT DO WHAT WE WANT!!
dplyr::filter(gender == c("female", "male"))## DANGER WILL ROBINSONAt a glance it looks like this produces the same output as the solution above - gender now contains only male or female participants. As you might have gathered from the all-caps comments above - intended to prevent you from accidentally using this code in the future for tasks like this - this is NOT what this code does.
To demonstrate what it does do, I need the dplyr::mutate() function from the next tutorial to create some new variables. The first new variable, double_equals, contains TRUEs and FALSEs for each case using the assertion with ==. The second is exactly the same, but reverses the order of the genders - something that should NOT make a difference to the matching! (We want either female OR male participants, regardless of which we happen to write first.) The third, in_op, contains the same again but this time with %in%. The final arrange() line sorts the dataset by gender to make the output easier to read.
anx_data |>
dplyr::mutate(
double_equals = (gender == c("female", "male")),
double_equals_rev = (gender == c("male", "female")),
in_op = (gender %in% c("female", "male")),
.keep = "used"
) |>
dplyr::arrange(gender)Notice anything wild?
For participants with the same value in gender, the assertions with == both flip between TRUE and FALSE, but in the reverse pattern to each other. The assertion with %in% correctly labels them all as TRUE. WTF?
What’s happening is that because the vector c("female", "male") contains two elements, the assertion with == matches the first case to the first element - female - and returns TRUE. Then it matches the second case to the second element - male - and this time returns FALSE. Then because there are more cases, it repeats: the next (third) case matches female and returns TRUE, the next male and FALSE, and so forth. The == assertion with the gender categories reversed does the same, but starts with male first and female second. Only %in% actually does what we wanted, which was to return TRUE for any case that matches female OR male.
This is a good example of what I think of as “dangerous” code. I don’t mean “reckless” or “irresponsible” - R is just doing exactly what I asked it to do, and it’s not the job of the language or package creators to make sure my code is right. I mean dangerous because it runs as expected, produces (what looks like) the right output, and even with some brief checking, would appear to contain the right cases - but would quietly result in a large chunk of the data being wrongly discarded. If you didn’t know about %in%, or how to carefully double-check your work, you could easily carry on from here and think no more about it.
So, how can we avoid a problem like this? Remember the ACE process described at the start of this tutorial: Anticipate, Code, Evaluate. In this case, the Evaluate step is key to catch that although the == code runs successfully, it has not done what we wanted. Especially for new operators or functions, it’s worth spending a lot of time on evaluating the output and checking what the code has produced in several ways to be sure it’s done what you wanted (and NOT done anything you didn’t want).
There are lots of ways to evaluate the output of code you’ve written, and over time you’ll develop your own checks that make the most sense to you. For now, one option is the code I created above, with new columns for the different assertion options - but this might be something you’d only think to do if you already knew about %in% or suspected there was a problem. A more routine check might stem from the Anticipate step, where I might lay out what I want the output of my code to look like:
I expect that when my filtering is accomplished, my dataset will contain all and only the participants who reported a gender identy of female or male, and no others. I will also have the same number of cases as the original dataset, less the number of other gender categories.
First, I’ll create a new dataset using the filtered data.
## SERIOUSLY THIS IS BAD
anx_data_bd <- anx_data |>
## DON'T USE THIS CODE FOR MATCHING
dplyr::filter(gender == c("female", "male"))
## STOP OH GOD PLEASE JUST DON'TCheck 1: Filtered data contains only male and female participants.
anx_data_bd |>
dplyr::count(gender)Only female and male participants! Tick ✅
At this point, though, I might become suspicious. The original dataset contained 465 cases - we’ve lost more than half! Can that be right? Better check the numbers.
## Get the numbers from the original dataset
anx_data |>
dplyr::count(gender)Uh oh. Already we can see that something’s wrong with the numbers. But instead of relying on visual checks, let’s let R tell us.
## Calculate how many cases we expect if the filtering had gone right
expected_n <- anx_data |>
dplyr::count(gender) |>
## This isn't the best way to filter
dplyr::filter(gender != "non-binary") |>
## The next section on multiple assertions has a much better method!
dplyr::filter(gender != "other/pnts") |>
dplyr::pull(n) |>
sum()
## Ask R whether the expected number of rows is equal to the actual number of rows in the filtered data
expected_n == nrow(anx_data_bd)[1] FALSE
Now we know for sure there’s a problem and can investigate what happened more thoroughly.
As a final stop on this incredibly lengthy detour (are you still here? 👋), you might wonder whether the check above would give me the wrong answer, because I used two filter()s in a row, and the whole point of this goose chase is how to accomplish that exact filtering task. First, this is NOT the way I would do this (as the comments suggest), but I’m really trying to stick to ONLY what we’ve already covered wherever possible. But let’s say I’d tried to do this with the bad == filtering that caused all this faff in the first place.
For this particular case there are four values in gender. If I try gender == c("female", "male") here, this DOES actually work fine - because the categories are in the right order and are a multiple of the length of the dataset 🤦 But at least the numbers still wouldn’t match, which would tell me that something went wrong with filtering the whole dataset.
anx_data |>
dplyr::count(gender) |>
dplyr::filter(gender == c("female", "male"))If I happened to have had the genders the other way round, I would have got an empty tibble, and hopefully that also would have clued me in that there was a problem with the original filtering.
anx_data |>
dplyr::count(gender) |>
dplyr::filter(gender == c("male", "female"))Multiple Assertions
Logical assertions can also be combined to specify exactly the cases you want to retain. The two most important operators are:
&(AND): ReturnsTRUEonly when all its constituent assertions returnTRUE- In other words, only cases that return
TRUEfor all assertions will be retained.
- In other words, only cases that return
|(OR): ReturnsTRUEwhenever any of its constituent assertions returnTRUE- In other words, any cases that return
TRUEfor at least one assertion will be retained.
- In other words, any cases that return
Let’s look at a couple minimal examples to get the hang of these two symbols. For each of these, you can think of the single response R gives as the answer to the questions, “Are ALL of these assertions true?” for AND, and “Is AT LEAST ONE of these assertions true?” for OR.
First, let’s start with a few straightforward logical assertions:
"apple" == "apple"[1] TRUE
23 > 12[1] TRUE
42 == "the answer"[1] FALSE
10 > 50[1] FALSE
Next, let’s look at how they combine.
Two true statements, combined with &, return TRUE, because it is true that all of these assertions are true.
"apple" == "apple" & 23 > 12[1] TRUE
Two true statements, combined with |, also return TRUE, because it true that at least one of these assertions is true.
"apple" == "apple" | 23 > 12[1] TRUE
Two false statements, combined with &, return FALSE, because it is NOT true that all of them are true.
42 == "the answer" & 10 > 50[1] FALSE
Two false statements, combined with |, return FALSE, because it is NOT true that at least one of them is true.
42 == "the answer" | 10 > 50[1] FALSE
One true and one false statement, combined with &, return FALSE, because it is NOT true that all of them are true.
23 > 12 & 42 == "the answer"[1] FALSE
One true and one false statement, combined with |, return TRUE, because it is true that at least one of them is true.
23 > 12 | 42 == "the answer"[1] TRUE
To see how this works, let’s filter anx_data to keep only participants that saw the stats MCQs, OR that scored 3 or higher on the first STARS test subscale item.
This requires two separate statements, combined with | “OR”:
- 1
-
Take the dataset
anx_data, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
mcqvariable is only and exactly equal to"stats", OR - 4
-
The value in
stars_test1is greater than or equal to 3.
To evaluate this output, have a look at the mcq and stars_test1 variables. The mcq variable will still contain both stats and maths, but where it contains maths, the value in the stars_test1 variable should be 3 or more.
This isn’t particularly thorough as evaluation checks go. I could use dplyr::count() again (handy, isn’t it?), but only because it happens to be the case that stars_test1 is a Likert variable with only whole-number values. If this had been a true numeric variable, this wouldn’t be nearly as helpful or readable.
## Same filtering code as above
anx_data |>
dplyr::filter(
mcq == "stats" |
stars_test1 >= 3
) |>
## Count by both mcq and stars_test1
dplyr::count(
mcq, stars_test1
)In the output we can see that the rows with “stats” MCQs can have any stars_test1 value 1-5, but the rows with “maths” MCQs only have a value 3 or above - just what we wanted.
To actually check this command in a way that would work for any numeric variable, I’ll need some functions that we won’t come across for a few more weeks, which tells me that the minimum value for rows that have “maths” MCQs was 3.
## Same filtering code as above
anx_data |>
dplyr::filter(
mcq == "stats" |
stars_test1 >= 3
) |>
## Group output by mcq variable
dplyr::group_by(mcq) |>
## Produce a summary tibble containing the minimum value of STARS test 1
dplyr::summarise(
min_test = min(stars_test1)
)Exercises
Data Cleaning
Filtering is absolutely invaluable in the process of data cleaning. In order to practice this process, I’ve introduced some messy values into the data, so let’s have a look at a method of cleaning up the dataset and documenting our changes as we go.
Thus far in this tutorial, we’ve been practicing filtering without storing our changes anywhere, but here we want our changes to “stick”. So, we are going to practice assigning the output of our piped commands to the same dataset name as we started with. This will overwrite the dataset with its updated version. Generally when you do data cleaning with this process, it’s a good idea to first draft your code WITHOUT assigning the output, so you can check your code and output carefully. Once you are confident that your code does only and exactly what you wanted it to do, assign the output to the dataset name to “save” your changes.
Pre-Exclusions
For data collected on platforms like Qualtrics, you can frequently test out your study via a preview mode. Responses completed via preview are still recorded in Qualtrics, but labeled as such in a variable typically called “DistributionChannel” or similar. In this dataset, we have a similar variable, distribution, that labels whether the data was recorded in a preview ("preview") or from real participants ("anonymous").
Your method may vary, but I wouldn’t bother to document these cases as “exclusions” because they aren’t real data. I would just drop them from the dataset - but of course make sure to record the code that does so.
- 1
-
Overwrite the dataset
anx_datawith the following output: Take the datasetanx_data, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
distributionvariable is exactly and only equal to “anonymous”.
We’ve seen this exact code structure before; the only difference from the examples above is that the output at the end of the pipe, including only “anonymous” cases, is assigned back to the same dataset name. In the bit of code anx_data <- anx_data |> ..., the start of the pipe is actually the second anx_data, and the final, updated object is the first.
Recording Exclusions
As a part of complete and transparent reporting, we will want to report all of the reasons we excluded cases from our dataset, along with the number excluded. We can build this counting process into our workflow so that at the end, we have a record of each exclusion along with initial and final numbers.
For each check below, our recording process will have two steps:
- Produce a dataset of the cases you will exclude, and count the number of rows (cases).
- Remove the cases and overwrite the old dataset with the new one.
In my process, I’m going to keep anx_data as the original, “raw” version of the dataset. So, I’ll create a copy in a new dataset object to use while “processing” that I will update as I go. This essentially functions as a save checkpoint; if I make a mistake with anx_dat_proc while conducting my data cleaning, I can restart from this point rather from the very beginning of my code.
anx_data_proc <- anx_dataTo begin, we will count the initial number of cases before any exclusions.
n_initial <- nrow(anx_data_proc)
n_initial[1] 453
Remember that we can use nrow() because there is only one participant per row. If we had long-form data with observations from the same participant across multiple rows, we would have to do something a bit different!
Consent
For many datasets, you would likely have a variable with responses from your participants about informed consent. How you filter this depends on what that variable contains, of course. However, we’ve already seen examples of this kind of operation earlier in this tutorial.
For the first assertion, we capture any responses that don’t match “Yes”, but for the second, we need to use a function from a family we met all the way back in Tutorial 01/02, namely is.na().
You can think of is.na() as a question about whatever is in its brackets: “Is (this) NA?” If the value IS an NA, R will return TRUE; if it’s anything else at all, R will return FALSE. So, to get an accurate count, we need to capture people who either answered something other than “Yes”, or didn’t answer at all.
n_no_consent <- anx_data_proc |>
dplyr::filter(consent != "Yes" | is.na(consent)) |>
nrow()
n_no_consent[1] 33
Then, we remove all participants who did not actively consent and assign the resulting dataset to the same name, overwriting the previous version. As we saw before, the below would discard cases that answered “No” (along with any other value not exactly matching “Yes”) and cases with NAs from people who didn’t answer.
anx_data_proc <- anx_data_proc |>
dplyr::filter(consent == "Yes")With a couple of examples of this process under our belts, it’s time for you to give it a go!
Age
For low-risk ethics applications, you may want to exclude people who reported an age below the age of informed consent (typically 18). This may look like age >= 18 or similar in your dataset. However, it’s also important to check for errors or improbable ages, or to remove any participants that are too old if your study has an upper age limit. In this case, my hypothetical study didn’t have an upper age limit, but I’ll designate any ages as 100 or above as unlikely to be genuine responses.
Missing Values
Finally (for now), just about any study will have to decide how to deal with missing values. The possibilities for your own work are too complex for me to have a guess at here, so for now we’ll only look at how to identify and remove missing values.
Single Variable
Let’s look at a single variable to begin with - for example, sticsa_trait_3. We can confirm that this variable has a/some NAs to consider by counting the unique values:
anx_data |>
dplyr::count(sticsa_trait_3)Now that we know this variable does contain missing values, it’s time to remove them.
Next, we need to actually exclude these cases. This time, we want to retain the inverse of the previous filtering requirement: that is, we only want to keep the cases that are NOT missing a value, the opposite of what we got from is.na(sticsa_trait_3). You may recognise “the inverse” or “not-x” as something we’ve seen before with !=, “not-equals”. For anything that returns TRUE and FALSE, you can get the inverse by putting an ! before it. (Try running !TRUE, for example!)
So, to create my clean anx_data_final dataset, I can use the assertion !is.na(sticsa_trait_3) to keep only the participants who answered this question - who do NOT have a missing value.
Finally, I can store the actual number of usable cases, according to my cleaning requirements, in a final object to use when reporting.
anx_data_final <- anx_data_proc |>
dplyr::filter(
!is.na(sticsa_trait_3)
)
n_final <- nrow(anx_data_final)
n_final[1] 390
Removing NAs is a tricky process, but if you’re sure that you want to drop all cases with missing values in your dataset, there are few helper functions to make this easy. However, this is a pretty major step and should be used with caution! If we didn’t check our data carefully, we could easily end up dropping a bunch of cases we didn’t want to get rid of.
For this, we’re going to leave filter() for a moment at look at a different function, tidyr::drop_na(). This function takes a tibble as input, and returns the same tibble as output, but with any rows that had missing values removed.
For example, if we apply it uncautiously here:
anx_data_proc |>
tidyr::drop_na()Well, there goes all our data! Why has every single row in the dataset been dropped?
This is something we could work out without any R whatsoever, just using the codebook and a bit of View mode to confirm. The Codebook tells us that participants were in one of two independent conditions: "maths" or "stats". Because of the wide format of the data, there are mcq_maths questions that are always NA for people in the statistics-MCQ condition, and vice versa for the mcq_stats questions and people in the maths-MCQ condition. So, every single participant - even those who answered every question - has at least some missing values, and dropping NAs without checking just bins the whole dataset.
If I wanted to check this with R, I’d be hard pressed to do it with only what we’ve covered so far. Using the some extra challenge functions from the next tutorial, though, I’d do this:
anx_data |>
dplyr::mutate(
## Create a new variable containing the number of missing values in each row
number_nas = rowSums(is.na(pick(everything())))
) |>
## Count how many missing values there are
dplyr::count(number_nas)So, there’s at least 5 NAs in every single row, and when we call tidyr::drop_na(), every single row is dropped.
RepRoducibility: Overwriting Data Frames
It may be tempting to save all the changes we have done to the dataset in the object containing the original dataframe, and if you always run your code line by line from the top to the bottom, this will not cause any issues. But, if at any point you want to go back up and rerun some piece of code on data you have already filtered out, the code won’t run or will give you incorrect output.
I (RB) find it best to create a new dataframe when you make any changes, but I also don’t want to end up with too many incrementally different dataframes in my environment. I have found that the best way to go about cleaning data is to make a copy of the dataframe with the raw data into a new object, and apply all of the changes you want to do in a single command/pipeline. That way, all code that uses the original, unprocessed data will continue to work even after I have completed all the tidying.
This is NOT, however, what we have done above, which is instead a “stepwise” method counting exclusions as we go for the purposes of reporting them programmatically. The downside of this stepwise method is that anx_data_proc could contain different cases if you run your code out of order. The best way to avoid this is to never run your code out of order4. Sometimes rerunning bits of code is unavoidable, for example if you need to go back to tweak your data cleaning later on, but it’s really key to get in the habit of rerunning all code when you’ve gone back to make a change upstream.
In the next tutorial, we’ll look at a better option for counting exclusions that would allow us to avoid this stepwise process.
Reporting
This section is optional and demonstrates how to use inline code. It will be most useful if you expect to frequently use Quarto documents for reporting results as well as performing analyses/writing code.
Inline code is a way of using short bits of R code in the text (“in-line”) of a Quarto document. It takes the form `r some_r_code`, and when rendered, the code in some_r_code will be evaluated and printed in that place in the text.
As an example of how handy this can be, here’s how it might look to write up a short paragraph in Quarto describing the exclusions we just performed in this section.
The initial sample consisted of
`r n_initial`cases. We removed`r n_no_consent`cases that did not consent,`r n_too_young`cases that reported an age below the ethical age of consent, and`r n_too_old`cases that reported improbable ages (100 years old or older). Finally, we removed`r n_sticsa_t3_missing`cases with who had not responded to the third trait item on the STICSA.This left us with a final sample of`r n_final`cases.
When you render your document, this should come out as:
The initial sample consisted of 453 cases. We removed 33 cases that did not consent, 22 cases that reported an age below the ethical age of consent, and 5 cases that reported improbable ages (100 years old or older). Finally, we removed 3 cases with who had not responded to the third trait item on the STICSA. This left us with a final sample of 390 cases.
There’s a huge advantage of this, namely ease of change. Imagine you had researchers from labs all over the world join the study and add a huge amount of new data to a massive collaborative dataset. In order to update all your numbers, all you have to do is update your initial anx_data dataset with the new cases, and then re-run all your code as is. Because these objects count whatever is in the data, they will automatically contain and record the correct numbers for the data you put into them5.
There are other advantages too - like confidence that you, a human person who may occasionally make mistakes (sorry, no offence meant!), won’t misread, mistype, or otherwise mistake the numbers, because at no point do you actually type a particular number yourself.
Nifty, eh?
Select
The select() function is probably the most straightforward of the core {dplyr} functions. Its primary job is to easily and transparently subset the columns within a dataset - in particular, a tibble. Rows are not affected by select(), only columns.
General Format
To subset a tibble, use the general format:
1dataset_name |>
2 dplyr::select(
3 variable_to_include,
4 -variable_to_exclude,
5 keep_this_one:through_this_one,
6 new_name = variable_to_rename,
7 variable_number
)- 1
-
Take the dataset
dataset_name, and then - 2
- Select the following variables:
- 3
- The name of a variable to be included in the output. Multiple variables can be selected separated by commas.
- 4
-
The name of a variable to be excluded from the output. Use either an exclamation mark (
!) or a minus sign (-) in front of each variable to exclude. Multiple variables can be dropped, separated by commas with a!(or-) before each. - 5
-
A range of variables to include in the output. All the variables between and including the two named will be selected (or dropped, with
!(drop_this_one:through_this_one)). - 6
-
Include
variable_to_renamein the output, but call itnew_name. - 7
- Include a variable in the output by where it appears in the dataset, numbered left to right. For example, “2” will select the second column in the original dataset.
Columns will appear in the output in the order they are selected in select(), so this function can also be used to reorder columns.
Selecting Directly
The best way to get the hang of this will be to give it a go, so let’s dive on in!
That’s really all there is to it!
…Or is it?6
Using tidyselect
The real power in select(), and in many other {tidyverse} functions, is in a system of helper functions and notations collectively called <tidyselect>. The overall goal of “<tidyselect> semantics” (as you will see it referred to in help documentation) is to make selecting variables easy, efficient, and clear.
At UG level at Sussex, students are not taught about <tidyselect> in core modules. However, <tidyselect> is desperately useful and makes complex data wrangling/cleaning a lot faster and more efficient, especially (for instance) for questionnaires with similarly-named subscales, so would make for a great collaborative activity with supervisors. <tidyselect> is also briefly introduced in an optional final-year workshop on working with Qualtrics data, so some students may have seen it before.
These helper functions can be combined with the selection methods above in any combination. Some very convenient options include:
everything()for all columnsstarts_with(),ends_with(), andcontains()for selecting columns by shared name elementswhere()for selecting with a function, described in the next (optional) section
Open the help documentation by running ?dplyr::select in the Console to see examples of how to use all of the <tidyselect> helper functions.
Rather than list examples of all the helper functions here, it’s best to just try them out for yourself. Since you’ll need to work with shared name elements, you might want to refer to the codebook for an overview of how the variables are named.
Using Functions
The material in this section isn’t covered in the live workshops. It’s included here for reference because it’s extremely useful in real R analysis workflows, but it won’t be essential for any of the workshop tasks.
If you want to skip it, jump down to the next section.
Let’s say we want to generate a summary table of the variables in our dataset. Before we can create our summary, though, we may first want to produce a subset of our dataset that only contains numeric variables.
To do this, we can use the <tidyselect> helper function where(). This helper function lets us use any function that returns TRUE and FALSE to select columns. Essentially, we don’t have to select columns using name or position - we can use any criteria we want, as long as we have (or can create…!) a function that expresses that criteria.
Especially helpful here is the is.*() family of functions in base R. This group of functions all have the same format, where the * is a stand-in for any type of data or object, e.g. is.logical(), is.numeric(), is.factor() etc. (The very useful is.na() that we’ve seen with filter() above is also a member of this family.) These functions work like a question about whatever you put into them - for example, is.numeric(x) can be read as, “Is x numeric data?”
You can quickly find all of the functions in this family by typing is. in a code chunk and pressing Tab.
Putting these two together, we could accomplish the task of selecting only numeric variables as follows:
anx_data |>
dplyr::select(
where(is.numeric)
)This command evaluates each column and determines whether they contain numeric data (TRUE) or not (FALSE), and only returns the columns that return TRUE.
Using Custom Functions
The function in where() that determines which columns to keep doesn’t have to be an existing named function. Another option is to use a “purrr-style lambda” or formula (a phrase you may see in help documentation) to write our own criteria on the spot.
For example, let’s select all of the numeric variables that had a mean of 3 or higher:
anx_data |>
dplyr::select(
where(~is.numeric(.x) & mean(.x, na.rm = TRUE) >= 3)
)Instead of just the name of a function, as we had before, we now have a formula inside where(). This formula has a few key characteristics:
- The
~(apparently pronounced “twiddle”!) at the beginning, which is a shortcut for the longerfunction(x) ...notation for creating functions. - The
.x, which is a placeholder for each of the variables that the function will be applied to.
So, this command can be read: “Take the anx_data dataset and select all the columns where the following is true: the data type is numeric AND the mean value in that column is greater than or equal to 3 (ignoring missing values).”
Exporting
RepRoducibility: Saving Data
After you have completed all your data cleaning, you may want to save the cleaned dataset in a new file so you don’t have to repeat the same steps every time you want to rerun your analysis. It is important to make sure you save the pre-processed data in a file with a new filename and not overwrite the raw data file, so that you don’t accidentally lose data you might need later on.
We’ve previously used readr::read_csv() to import a CSV file into R. The other side of the coin is readr::write_csv(), which writes a dataset/tibble to a CSV file.
This function needs two elements: the name of the object in your Environment to be written to file, and the file path, including the file name, of the CSV file to be created.
readr::write_csv(anx_data_final,
here::here("data/anx_data_final.csv"))Don’t forget to give it a different name than the original, raw dataset, and to include the .csv file extension.
If you run the code above, you should see the new dataset appear in your data folder. If you want to run an analysis with this data, you can read in and work with the final, cleaned dataset instead of having to rerun your cleaning code.
Quick Test
In the very first of these tutorials, we ran a t-test using some vectors of made-up data. We’re going to do the same thing again, but look at how we can use real data in a dataset instead.
There are a lot of options in the t.test() function, which can be used, through different arguments, to run almost any variety of t-test you can think of. Here, the “stats” and “maths” groups are independent groups, so we can mostly go with the defaults. The formula takes the form outcome ~ predictor, where the predictor is the grouping variable.
Note that the output mentions “Welch Two Sample t-test”, which is a version of the test that does not assume equal variances. This is the version that is taught to undergraduates, because we have not at this point introduced the process of assumption testing. If you definitely know that the variances are equal and you definitely want Student’s t-test, you can instead change the default setting.
Finally, we may want to report this t-test result. We could do this manually with inline code, but as a cheeky shortcut we’ll make use of the {report} package from {easystats}, which are a coherent and interconnected family of packages like {tidyverse}. It really is this easy:
mcq_ttest <- t.test(sticsa_post_state_1 ~ mcq,
data = anx_data_final,
var.equal = TRUE)
report::report(mcq_ttest)Effect sizes were labelled following Cohen's (1988) recommendations.
The Two Sample t-test testing the difference of sticsa_post_state_1 by mcq
(mean in group maths = 1.98, mean in group stats = 1.88) suggests that the
effect is positive, statistically not significant, and very small (difference =
0.10, 95% CI [-0.09, 0.29], t(386) = 1.04, p = 0.298; Cohen's d = 0.11, 95% CI
[-0.09, 0.31])
You may not want to use this exact text as it is in your own papers, but it makes a nicely readable bit of output rather than a jumble of numbers.
Whew, that’s it! Well done getting through all that. In the next tutorial, we will look at more {dplyr} powerhouses to round out your data wrangling toolkit.
Footnotes
Sorry 😅↩︎
This incredibly useful property is called “data masking”. If you want to know more, run
vignette("programming")in the Console.↩︎I’m not wild about this example - the experiences of non-binary and other genders are just as important! Unfortunately it’s the only variable in the dataset with the right number of categories.↩︎
JM: This reminds me of the joke: a person goes to their GP and says “Doctor, my leg hurts when I move it this way.” The doctor says, “Well, don’t move it that way then!”↩︎
I’m sure Jenny would tell you there’s a little more to it than that, especially with 12,570 students from 100 universities in 35 countries, collected in 21 languages! But that’s both the dream and the general idea.↩︎
Have you seen the size of this tutorial?? Of course it isn’t!↩︎